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Abstract 

In this work we are interested in the problems of supervised learning and variable se- 
lection when the input-output dependence is described by a nonlinear function depending 
on a few variables. Our goal is to consider a sparse nonparametric model, hence avoid- 
ing linear or additive models. The key idea is to measure the importance of each variable 
in the model by making use of partial derivatives. Based on this intuition we propose 
a new notion of nonparametric sparsity and a corresponding least squares regularization 
scheme. Using concepts and results from the theory of reproducing kernel Hilbert spaces 
and proximal methods, we show that the proposed learning algorithm corresponds to a 
minimization problem which can be provably solved by an iterative procedure. The con- 
sistency properties of the obtained estimator are studied both in terms of prediction and 
selection performance. An extensive empirical analysis shows that the proposed method 
performs favorably with respect to the state-of-the-art methods. 

Keywords: Sparsity Nonparametrics, Variable selection, Regularization, Proximal meth- 
ods, RKHS 

1 Introduction 

It is now common to see practical applications, for example in bioinformatics and computer 
vision, where the dimensionality of the data is in the order of hundreds, thousands and even 
tens of thousands. It is known that learning in such a high dimensional regime is feasible only 
if the quantity to be estimated satisfies some regularity assumptions |24| . In particular, the 
idea behind, so called, sparsity is that the quantity of interest depends only on a few relevant 
variables (dimensions). In turn, this latter assumption is often at the basis of the construction 
of interpretable data models, since the relevant dimensions allow for a compact, hence inter- 
pretable, representation. An instance of the above situation is the problem of learning from 
samples a multivariate function which depends only on a (possibly small) subset of relevant 
variables. Detecting such variables is the problem of variable selection. 

Largely motivated by recent advances in compressed sensing [151 [251, the above problem 
has been extensively studied under the assumption that the function of interest (target function) 
depends linearly to the relevant variables. While a naive approach (trying all possible subsets of 



variables) would not be computationally feasible it is known that meaningful approximations 
can be found either by greedy methods 11531 , or convex relaxation {I 1 regularization a.k.a. basis 
pursuit or LASSO |52l[T7ll28ll ). In this context efficient algorithms (see |50ll39| and references 
therein) as well as theoretical guarantees are now available (see |Tl4| and references therein). In 
this paper we are interested into the situation where the target function depends non-linearly 
to the relevant variables. This latter situation is much less understood. Approaches in the 
literature are mostly restricted to additive models p3| . In such models the target function is 
assumed to be a sum of (non-linear) univariate functions. Solutions to the problem of vari- 
able selection in this class of models include [48J and are related to multiple kernel learning 
l8l . Higher order additive models can be further considered, encoding explicitly dependence 
among the variables - for example assuming the target function to be also sum of functions de- 
pending on couples, triplets etc. of variables, as in [38] and [7j. Though this approach provides 
a more interesting, while still interpretable, model, its size/ complexity is essentially more than 
exponential in the initial variables. Only a few works, that we discuss in details in Section |2j 
have considered notions of sparsity beyond additive models. 

In this paper, we propose a new approach based on the idea that the importance of a vari- 
able, while learning a non-linear functional relation, can be captured by the corresponding 
partial derivative. This observation suggests a way to define a new notion of nonparametric 
sparsity and a corresponding regularizer which favors functions where most partial deriva- 
tives are essentially zero. The question is how to make this intuition precise and how to derive 
a feasible computational learning scheme. The first observation is that, while we cannot mea- 
sure a partial derivative everywhere, we can do it at the training set points and hence design a 
data-dependent regularizer. In order to derive an actual algorithm we have to consider two 
further issues: How can we estimate reliably partial derivatives in high dimensions? How can 
we ensure that the data-driven penalty is sufficiently stable? The theory of reproducing kernel 
Hilbert spaces (RKHSs) provides us with tools to answer both questions. In fact, partial deriva- 
tives in a RKHS are bounded linear functionals and hence have a suitable representation that 
allows efficient computations. Moreover, the norm in the RKHS provides a natural further reg- 
ularizer ensuring stable behavior of the empirical, derivative based penalty. Our contribution 
is threefold. First, we propose a new notion of sparsity and discuss a corresponding regular- 
ization scheme using concept from the theory of reproducing kernel Hilbert spaces. Second, 
since the proposed algorithm corresponds to the minimization of a convex, but not differen- 
tiable functional, we develop a suitable optimization procedure relying on forward-backward 
splitting and proximal methods. Third, we study properties of the proposed methods both 
in theory, in terms of statistical consistency, and in practice, by means of an extensive set of 
experiments. 

Some preliminary results have appeared in a short conference version of this paper 11491 . 
With respect to the conferecen version, the current version contains: the detailed discussion of 
the derivation of the algorithm with all the proofs, the consistency results of Section 4, an aug- 
mented set of experiments and several further discussions. The paper is organized as follows. 
In section|3]we discuss our approach and present the main results in the paper. In Section|4]we 
discuss the computational aspects of the method. In Section|5]we prove consistency results. In 
Section [6] we provide an extensive empirical analysis. Finally in Section [7] we conclude with a 
summary of our study and a discussion of future work. 
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2 Problem Setting and Previous Work 



Given a training set z n = (x, y) = (xj,?/i)f=i of input output pairs, with Xi G X C M d and 
i/i 6 y C t, we are interested into learning about the functional relationship between input 
and output. More precisely, in statistical learning the data are assumed to be sampled identi- 
cally and independently from a probability measure p on X x y so that if we measure the error 
by the square loss function, the regression function f p (x) = f ydp(x, y) minimizes the expected 
risk£(/) = f(y-f(x)fdp( Xl y). 

Finding an estimator / of f p from finite data is possible, if f p satisfies some suitable prior as- 
sumption [24 J. In this paper we are interested in the case where the regression function is sparse 
in the sense that it depends only on a subset R p of the possible d variables. Estimating the set 
R p of relevant variables is the problem of variable selection. 

Linear and additive models The sparsity requirement can be made precise considering linear 
functions f(x) = Ylt=i Pa,x a with x = (x 1 , . . . , x d ). In this case the sparsity of a function is 
quantified by the so called zero-norm ^o(/) = #{a = 1 ; ■ • • > d | /3 a ^ 0}- The zero norm, while 
natural for variable selection, does not lead to efficient algorithms and is often replaced by the 
i 1 norm, that is f2i(/) = Ylt=i I A* I- This approach has been studied extensively and is now 
fairly well understood, see |fl4| and references therein. Regularization with I 1 regularizers, 
obtained by minimizing 

1 n 

8{f) + Afii(/), £(f) = - - f( Xl )f, 

i=i 

can be solved efficiently and, under suitable conditions, provides a solution close to that of the 
zero-norm regularization. 

The above scenario can be generalized to additive models f(x) = J2t=i fa{x a ), where f a 
are univariate functions in some (reproducing kernel) Hilbert spaces % a , a = 1 , . . . , d. In this 
case the analogous of the zero-norm and the l l norm are ^o(/) = #{a £ {1, . . . , d} : ||/ a || / 0} 
and Sli (/) = Yla=i II /a II' respectively. This latter setting, related to multiple kernel learning 
(HI IU, has been considered for example in 11481 , see also II36II and references therein. Consider- 
ing additive models limits the way in which the variables can interact. This can be partially 
alleviated considering higher order terms in the model as it is done in ANOVA decomposi- 
tion [ 58"ll3~Tf . More precisely, we can add to the simplest additive model functions of couples 
f a ,b(x a ,x b ), triplets f a ,b,c(x a , x b , x c ), etc. of variables - see |38 | . For example one can consider 
functions of the form f(x) = ^2a=i fa(x a ) + ^2 a< b fa,b(x a , x b ). In this case the analogous to the 
zero and I 1 norms are fi (/) = #{a = 1, . . . , d : ||/ a || ± 0} + #{(a, b) : a < b, ||/ 6)C || ^ 0} 
and f2i(/) = J2t=i \\fa\\ + Y2 a <b ll/a,&||/ respectively. Note that in this case sparsity will not be 
in general with respect to the original variables but rather with respect to the elements in the 
additive model. Clearly, while this approach provides a more interesting and yet interpretable 
model, its size /complexity is essentially more than exponential in the number of variables. 
Some proposed attempts to tackle this problem are based on restricting the set of allowed spar- 
sity patterns and can be found in 0. 

2.1 Nonparametric approaches 

The above discussion naturally raises the question: 

What if we are interested into learning and performing variable selection when the functions of interest 
are not described by an additive model? 
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Few papers have considered this question. Here we discuss in some more details ff37l Il3l 
Miller and Hall(2010) |, \20\, to which we also refer for further references. 



The first three papers fl37llT3l Miller an d Hall(2010) ] follow similar approaches focusing on the 



point-wise estimation of the regression function and of the relevant variables. The basic idea is 
to start from a locally linear (or polynomial) point wise estimator f n (x) at a point x obtained 
from the minimizer of 

1 n 

~y^(yi- {w,Xi- x) Rd ) 2 K H (xi- x) (1) 

i=l 

where Kh is a localizing window function depending on a matrix (or a vector) H of smoothing 
parameters. Different techniques are used to (locally) select variables. In the RODEO algo- 
rithm [37], the localizing window function depends on one smoothing parameter per variable 
and the partial derivative of the local estimator with respect to the smoothing parameter is 
used to select variables. In l|T3l , selection is considering a local lasso, that is an l\ to the local 
empirical risk functional ([TJ. In the LABAVS algorithm discussed in [Mill er and Hall(2010)[ 
several variable selection criterion are discussed including the local lasso, hard thresholding, 
and backward step wise approach. The above approaches typically leads to cumbersome com- 
putations and do not scale well with the dimensionality of the space and with the number of 
relevant variables. 

Indeed, in all the above works the emphasis is in the theoretical analysis quantifying the estima- 
tion errors of the proposed methods. It is shown in 11371 that the RODEO algorithm is a nearly 
optimal pointwise estimator of the regression function, under assumption on the marginal 
distribution and the regression functions. These results are further improved in |T3"fl where 
optimal rates are derived under milder assumptions and sparsistency (the recovery of R p ) is 
also studied. Uniform error estimates are derived in [Miller and Hall(2010)| (see Section 2.6 in 
| Miller an d Hall(2010)| for further discussions and comparison). More recently, an estimator 



based on the comparison of some well chosen empirical Fourier coefficients to a prescribed sig- 
nificance level is described and studied in 11201 where a careful statistical analysis is proposed 
considering different regimes for n, d and d* , where d* is the cardinality of R p . Finally, in a 
slightly different context, [23| studies the related problem of determining the number of func- 
tion values at adaptively chosen points that are needed in order to correctly estimate the set of 
globally relevant variables. 

3 Sparsity Beyond linear Models 

In this section we present our approach and summarize our main contributions. 



3.1 Sparsity and Regularization using Partial Derivatives 

Our study starts from the observation that, if a function / is differentiable, the relative impor- 
tance of a variable at a point x can be captured by the magnitude of the corresponding partial 
derivative^ 

df_ 
dx a 

This observation can be developed into a new notion of sparsity and corresponding regulariza- 
tion scheme that we study in the rest of the paper. We note, that tegularization using derivatives 

order for the partial derivatives to be defined at all points we always assume that the closure of X coincides 
with the closure of its interior. 
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is not new. Indeed, the classical splines (Sobolev spaces) regularization [57], as well as more 
modern techniques such as manifold regularization l|T2) use derivatives to measure the regular- 
ity of a function. Similarly total variation regularization utilizes derivatives to define regular 
function. None of the above methods though allows to capture a notion of sparsity suitable 
both for learning and variable selection- see Remark [T] 

Using partial derivatives to define a new notion of a sparsity and design a regularizer for 
learning and variable selection requires considering the following two issues. First, we need 
to quantify the relevance of a variable beyond a single input point to define a proper (global) 
notion of sparsity. If the partial derivative is continuous]^] then a natural idea is to consider 



df 



dx° 



PX 




(2) 



where px is the marginal probability measure of p on X. While considering other L p norms is 
possible, in this paper we restrict our attention to L 2 . A notion of nonparametric sparsity for a 
smooth, non-linear function / is captured by the following functional 



«o (/) = #{a = l, 



df 



dx° 



7^0 



(3) 



PX 



and the corresponding relaxation is 



n?(/) = E 



df 



dx a 



PX 



The above functionals encode the notion of sparsity that we are going to consider. While for 
linear models, the above definition subsumes the classic notion of sparsity, the above definition 
is non constrained to any (parametric) additive model. 

Second, since px is only known through the training set, to obtain a practical algorithm we 
start by replacing the L 2 norm with an empirical version 



df 



dx a 



1 n 

n ^ 

i=l 



dfjxj] 
dx a 



and by replacing Q by the data-driven regularizer, 



fif (/) = E 



a=l 



df 



dx a 



(4) 



While the above quantity is a natural estimate of in practice it might not be sufficiently 
stable to ensure good function estimates where data are poorly sampled. In the same spirit of 
manifold regularization [12j, we then propose to further consider functions in a reproducing 
kernel Hilbert space (RKHS) defined by a differentiable kernel and use the penalty, 



nf(/) 



2 

Hi 



where v is a small positive number. The latter terms ensures stability while making the regu- 
larizer strongly convex. This latter property is a key for well-posedeness and generalization, as 



2 In the following, see Remark [2] we will see that further appropriate regularity properties on / are needed 
depending on whether the support of px is connected or not. 
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Figure 1: Difference between I 1 /fr and i 1 /£ 2 norm for binary matrices (white = 1, black=0), 
where in the latter case the t 1 norm is taken over the rows (variables) and the t 2 norm over the 
columns (samples). The two matrices have the same number of nonzero entries, and thus the 
same i 1 /i 1 norm, but the value of the i 1 /I 2 norm is smaller for the matrix on the right, where 
the nonzero entries are positioned to fill a subset of the rows. The situation on the right is thus 
favored by i 1 /I 2 regularization. 





we discuss in Section |5j As we will see in the following, RKHS will also be a key tool allowing 
computations of partial derivative of potentially high dimensional functions. 
The final learning algorithm is given by the minimization of the functional 



1 n / d 

i£(B-/(^)) 2 + r £ 

i=l \a=l 



Of 



dx a 



+ v 



(5) 



The remainder of the paper is devoted to the analysis of the above regularization algorithm. 
Before summarizing our main results we add two remarks. 

Remark 1 (Comparison with Derivative Based Regulrizers). It is perhaps useful to remark the 
difference between the regularizer we propose and other derivative based regularizes. We start by con- 
sidering 

2 -i n d /OJ ./ > N 2 



E 

a=l 



Of 



dx a 



1 



n 



EE 

i=l a=l 



dfjxj 
dx a 



1 



n 



£liv/(x.) 



where Vf(x) is the gradient off at x. This is essentially a data-dependent version of the classical penalty 
in Sobolev spaces which writes f \\V f(x)\\ 2 dx, where the uniform (Lebesgue) measure is considered. It 
is well known that while this regularizer measure the smoothness it does not yield any sparsity property. 

A different derivative based regularizer is given by \ Ya=i Sf=i d dx^ • Though this penalty (which 
we call I 1 /i 1 ) favors sparsity, it only forces partial derivative at points to be zero. In comparison the reg- 
ularizer we propose is of the I 1 /I 2 type and utilizes the square root to "group" the values of each partial 
derivative at different points hence favoring functions for which each partial derivative is small at most 
points. The difference between penalties is illustrated in Figure^ Finally note that we can also con- 
sider i Ya=i \\^f( x i)\\ - This regularizer, which is akin to the total variation regularizer f \\X7f(x)\\dx, 
groups the partial derivatives differently and favors functions with localized singularities rather than 
selecting variables. 

Remark 2. As it is clear from the previous discussion, we quantify the importance of a variable based 
on the norm of the corresponding partial derivative. This approach makes sense only if 



I — I 
dx a 



PX 



/ is constant with respect to x a . 



(6) 



The previous fact holds trivially if we assume the function f to be continuously differentiable (so that the 
derivative is pointwise defined, and is a continuous function) and supppx to be connected. If the latter 



assumption is not satisfied the situation is more complicated, as the following example shows. Suppose 
that px is the uniform distribution on the disjoint intervals [—2,-1] and [1,2], and y = {—1,1}. 
Moreover assume that p(y\x) = 6-i, if x G [—2,-1] and p(y\x) = Si, if x e [1,2]. Then, if we 
consider the regression function 



we get that f'(x) = on the support of px, although the variable x is relevant. To avoid such pathological 
situations when supppx is not connected in R d we need to impose more stringent regularity assumptions 
that basically imply that a function which is constant on a open interval is constant everywhere. This 
is verified when f belongs to the RKHS defined by a polynomial kernel, or, more generally, an analytic 
kernel such as the Gaussian kernel. 

3.2 Main Results 

We summarize our main contributions. 

1. Our main contribution is the analysis of the minimization of ^ and the derivation of a 
provably convergent iterative optimization procedure. We begin by extending the repre- 
senter theorem [57| and show that the minimizer of ^ has the finite dimensional repre- 



with a, (Pai)i=i ^ ^ n for all a = l,...,d. Then, we show that the coefficients in the 
expansion can be computed using forwards-backward splitting and proximal methods 
|[T8l l9ll. More precisely we present a fast forward-backward splitting algorithm, in which 
the proximity operator does not admit a closed form and is thus computed in an ap- 
proximated way. Using recent results for proximal methods with approximate proximity 
operators, we are able to prove convergence (and convergence rates) for the overall pro- 
cedure. The resulting algorithm requires only matrix multiplications and thresholding 
operations and is in terms of the coefficients a and (3 and matrices given by the kernel 
and its first and second derivatives evaluated at the training set points. 

2. We study the consistency properties of the obtained estimator. We prove that, if the kernel 
we use is universal, then there exists a choice of r = r n depending on n such that the 
algorithm is universally consistent II5T1 , that is 



for all e > 0. Moreover, we study the selection properties of the algorithm and prove that, 
if R p is the set of relevant variables and R Tn the set estimated by our algorithm, then the 
following consistency result holds 




sentation 




lim V(£{h)-£{f p )>e) = 
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Figure 2: Comparison of predictions for a radial function of 2 out of 20 variables (the 18 
irrelevant variables are not shown in the figure). In the upper left plot is depicted the value 
of the function on the test points (left), the noisy training points (center), the values predicted 
for the test points by our method (DENOVAS) (right). The bottom plots represent the values 
predicted for the test points by state-of-the-art algorithms based on additive models. Left: 
Multiple kernel learning based on additive models using kernels. Center: COSSO, which is a 
higher order additive model based on ANOVA decomposition |38|. Right: Hierarchical kernel 
learning Q. 



Test points Training points DENOVAS 




3. Finally we provide an extensive empirical analysis both on simulated and benchmark 
data, showing that the proposed algorithm (DENOVAS) compares favorably and often 
outperforms other algorithms. This is particularly evident when the function to be esti- 
mated is highly non linear. The proposed method can take advantage of working in a rich, 
possibly infinite dimensional, hypotheses space given by a RKHS, to obtain better estima- 
tion and selection properties. This is illustrated in Figure |2j where the regression function 
is a nonlinear function of 2 of 20 possible input variables. With 100 training samples the 
algorithms we propose is the only one able to correctly solve the problem among different 
linear and non linear additive models. On real data our method outperforms other meth- 
ods on several data sets. In most cases, the performance of our method and regularized 
least squares (RLS) are similar. However our method brings higher interpretability since 
it is able to select a smaller subset of relevant variable, while the estimator provided by 
RLS depends on all variables. 



4 Computational Analysis 

In this section we study the minimization of the functional (pj. 
4.1 Basic Assumptions 

We first begin by listing some basic conditions that we assume to hold throughout the paper. 

We let p be a probability measure on X x y with X C R d and y C R. A training set 
z n = (x, y) = (xi, yi)f = i is a sample from p n . We consider a reproducing kernel K : X x X — > R 
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Pt and the associated reproducing kernel Hilbert space. We assume p and K to satisfy the 
following assumptions. 

[Al] There exists K\ < oo such that sup^g^ \\t h> k(x, t)\\u < n\. 

[A2] The kernel k is C 2 (X x X) and there exists k-i < oo such that for all a = 1, . . . ,d we have 

\h < k 2 - 



«nn \\t i-A 9k (s,x) 

su Pxex II 1 ^ a s o 



[A3] T/jere exz'sfs M < oo such that y C [-M, M]. 
4.2 Computing the regularized solution 

We start our analysis discussing how to compute efficiently a regularized solution of the func- 
tional 

i=l 

where f2f(/) is defined in Q. We start observing that the term ||/||^ makes the above func- 
tional coercive and strongly convex with modulus^ rz//2, so that standard results ([29j) ensures 
existence and uniqueness of a minimizer f T , for any v > 0. 

The rest of this section is divided into two parts. First we show how the theory of RKHS 
(TJ allows to compute derivatives of functions on high dimensional spaces and also to derive 
a new representer theorem that allows to deal with finite dimensional minimization problems. 
Second we discuss how to apply proximal methods |Tl8l |9l to derive an iterative optimization 
procedure for which we can prove convergence. It is possible to see that the solution of Problem 
§7§ can be written as 

n ^ n d 1 

= E + E E -MW Xi (x), (8) 

i=l i=l a=l 

where a, {P a ,i)f = i G R n for all a = l,...,dk x is the function t M> k(x, t), and (d a k) x denotes par- 
tial derivatives of the kernel, see |20) . The main outcome of our analysis is that the coefficients 
a and /3 can be provably computed through an iterative procedure. To describe the algorithm 
we need some notation. For all a, b = 1, . . . , d, we define the n x n matrices K, Z a , L a ^ as 

Kjj = —k(xi,Xj), (9) 



[Z a 

and 



1 dk(s,Xj) 
JJ ~~ n ds a 



1 d k(x,s) 
n 



(10) 



[Mi,; - dx a ds b 



3 We say that a function £ : T-L R U {+00} is: 

• coercive if ]jm\\f\\-+ +00 £(f)/\\f\\ = +oo; 

• strongly convex of modulus fi if £(tf + (1 - t)g) < t£(f) + (1 - t)£(g) - f t(l -t)\\f- g\\ 2 for all t G [0, 1] 
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for alii, j = l,...,n. Clearly the above quantities can be easily computed as soon as we have 
an explicit expression of the kernel, see Example[l]in Appendix|A] We introduce also the nxnd 
matrices 

Z = (Zi, . . . , Z d ) 

L a = (L a ,i,...,L M ) Va = l, ...,d (11) 

and the nd x nd matrix 



Li,i • • • Li * \ ( L a 



L 



Ld,i ■ ■ ■ L d)d J \L d 
Denote with B n the unitary ball in W 1 , 

B n = {v G R n \\\v\\ n < 1}. (12) 

The coefficients in <|8j are obtained through Algorithm [lj where (3 is considered as a nd column 
vector = (/3 h i, P hn , f3 dA , P dtH ) T . 

Algorithm 1 

Given: parameters r, v > and step-sizes u, r? > 
Initialize: a = a 1 = 0, P° = /3 1 = 0, si = 1, u° = 0, t = 1 
while convergence not reached do 
t = t + 1 

1 



*=2(1 + V1 + 4 S ?_ 1 ) (13) 



3* = ( 1 + a*- 1 + 1^V~ 2 , ^ = f 1 + *=±=±) ^ + 1^1^ 

St J s t \ s t J s t 



(14) 



o' [ 1 - ™) a* - ^ (Ka* + - y ) (15) 



set t>° = f* 1 , q = 

while convergence not reached do 
g = g + l 

for a = 1, . . . d do 



end for 
end while 

set v f = v q 



end while 
return (a t ,P t ) 



V a = 7Tx Bn (vf 1 " ^ (Kv"- 1 ~ (ZW + (l - T -P) La/3*))) (16) 



0t = l 1 _™)p_ f) \ ( 17 ) 



a 



The proposed optimization algorithm consists of two nested iterations, and involves only 
matrix multiplications and thresholding operations. Before describing its derivation and dis- 
cussing its convergence properties, we add three remarks. First, the proposed procedure re- 
quires the choice of an appropriate stopping rule, which will be discussed later, and of the step 
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sizes a and 77. The simple a priori choice a = ||K|| + rzv, 77 = ||L|| ensures convergence, as dis- 



cussed in the Subsection 4.5 and is the one used in our experiments. Second, the computation 



of the solution for different regularization parameters can be highly accelerated by a simple 



warm starting procedure, as the one in 1132 1 . Finally, in Subsection 4.6 we discuss a principled 
way to select variable using the norm of the coefficients (v t a ) d 



1 a- 



4.3 Kernels, Partial Derivatives and Regularization 

We start discussing how (partial) derivatives can be efficiently computed in RKHSs induced 
by smooth kernels and hence derive a new representer theorem. Practical computation of the 
derivatives for a differentiable functions is often performed via finite differences. For functions 
defined on a high dimensional space such a procedure becomes cumbersome and ultimately 
not-efficient. RKHSs provide an alternative computational scheme. 

Recall that the RKHS associated to a symmetric positive definite function k : X x X — > R is 
the unique Hilbert space (H, (-, such that k x = k(x, •) G %, for all x G X and 

f(x) = {f,k x ) H , (18) 

for all / G H,x G X. Property ( [T8] > is called reproducing property and k is called reproducing 
kernel (TJ. We recall a few basic facts. The functions in % can be written as pointwise limits 
of finite linear combinations of the type Y^.=i otik Xi , where a.{ G E, X4 G X for all i. One of the 
most important results for kernel methods, namely the representer theorem [57], shows that a 
large class of regularized kernel methods induce estimators that can be written as finite linear 
combinations of kernels centered at the training set points. In the following we will make use 
of the so called sampling operator, which returns the values of a function / G % at a set of input 
points x = (xi, . . . , x n ) 

S:H^M n , (Sf)i = (f,k Xi ), i = l,...,n. (19) 

The above operator is linear and bounded if the kernel is bounded- see Appendix |Aj which is 
true thanks to Assumption (Al). 

Next, we discuss how the theory of RKHS allows efficient derivative computations. Let 



(a.*),- Sfc(Jv) 



ds a 



(20) 



be the partial derivative of the kernel with respect to the first variable. Then, from Theorem 1 
in 1159 1 we have that, if k is at least aC 2 (X x X), (d a k) x belongs to % for all x G X and most 
importantly 

df(x) . . . . 

for a = l,...,d,x G X. It is useful to define the analogous of the sampling operator for 
derivatives, which returns the values of the partial derivative of a function / G % at a set of 
input points x = (x\,..., x n ), 

D a :H —?■ R n , (b a f)i = (f, (d a k) Xi ), (21) 

where a = 1, . . . , d, i = 1, . . . , n. It is also useful to define an empirical gradient operator 
V : H -> (R n ) d defined by V/ = (D a f)t=v The above operators are linear and bounded, since 
assumption [A2] is satisfied. We refer to Appendix [A] for further details and supplementary 
results. 

Provided with the above results we can prove a suitable generalization of the representer 
theorem. 
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Proposition. The minimizer of§7j can be written as 

n ^ n d j 

f T = Y ~ aik ^ + Y Y -P*A d a k hi 
i=l i=l a=l 

with aeRand (3 G M nd . 

The above result is proved in Appendix [A] and shows that the regularized solution is deter- 
mined by the set of n + nd coefficients a E W 1 and j3 E M. nd . We next discuss how such 
coefficients can be efficiently computed. 

Notation. In the following, given an operator A we denote by A* the corresponding adjoint 
operator. When A is a matrix we use the standard notation for the transpose A T = A* . 

4.4 Computing the Solution with Proximal Methods 

The functional £ T is not differentiate, hence its minimization cannot be done by simple gra- 
dient methods. Nonetheless it has a special structure that allows efficient computations using 
a forward-backward splitting algorithm [18|, belonging to the class of the so called proximal 
methods. 

Second order methods, see for example Ifl6| , could also be used to solve similar problems. 
These methods typically converge quadratically and allows accurate computations. However, 
they usually have a high cost per iteration and hence are not suitable for large scale problems, as 
opposed to first order methods having much lower cost per iteration. Furthermore, in the sem- 
inal paper by Nesterov ||45j first-order methods with optimal convergence rate are proposed 
|44). First order methods have since become a popular tool to solve non-smooth problems in 
machine learning as well as signal and image processing, see for example FISTA - [9] and ref- 
erences therein. These methods have proved to be fast and accurate IfTOl , both for i l -based 
regularization - see Ifl8| . (2T| , |30| , fl40| - and more general regularized learning methods - see 
for example |27|, ||43|, 1551 -■ 



Forward-backward splitting algorithms The functional £ T is the sum of the two terms F{-) = 
£(■) + Ti/\\-\\y_ and 2r0f\ The first term is strongly convex of modulus tv and differentiable, 
while the second term is convex but not differentiable. The minimization of this class of func- 
tionals can be done iteratively using the forward-backward (FB) splitting algorithm, 

f = prox^ f (/* - ^VF(f)) (22) 
f = c u f- x + c 2 ,f- 2 (23) 

where /° = f 1 G % is an arbitrary initialization, c\ : t, c 2 ,t are suitably chosen positive sequences, 
and proXr^D ■.'H^-'His the proximity operator [42] defined by, 

(7 1 

proXr^ D (/) = argmin ( -fif (g) + -||/-5|| 2 ) . 

The above approach decouples the contribution of the differentiable and not differentiable 
terms. Unlike other simpler penalties used in additive models, such as the l l norm in the 
lasso, in our setting the computation of the proximity operator of fif is not trivial and will 



12 



be discussed in the next paragraph. Here we briefly recall the main properties of the itera- 
tion ( [221 , ( |23| depending on the choice of c± t t, C2,t and a. The basic version of the algorithm 
[TBI, sometimes called ISTA (iterative shrinkage thresholding algorithm [9]), is obtained set- 
ting ci i = 1 and C2,t = for all t > 0, so that each step depends only on the previous iterate. 
The convergence of the algorithm for both the objective function values and the minimizers is 
extensively studied in ||T8ll , but a convergence rate is not provided. In O it is shown that the 
convergence of the objective function values is of order 0(1/ 1) provided that the step size a 
satisfies a > L, where L is the Lipschitz constant of VF/2. An alternative choice of c\j and 
C2,t leads to an accelerated version of the algorithm |22"] |, sometimes called FISTA (fast iterative 
shrinkage thresholding algorithm 115411911), which is obtained by setting sq = 1, 

st = \ U + ^l + 4 S ?_i) , c M = 1 + Sjz ^, and c 2 ,t = ^—^= ± - (24) 

The algorithm is analyzed in [9] and in |54| where it is proved that the objective values gener- 
ated by such a procedure have convergence of order 0(l/i 2 ), if the step size satisfies a > L. 

Computing the Lipscthitz constant L can be non trivial. Theorems 3.1 and 4.4 in [9| show 
that the iterative procedure ( [22) with an adaptive choice for the step size, called backtracking, 
which does not require the computation of L, shares the same rate of convergence of the cor- 
responding procedure with fixed step-size. Finally it is well known that, if the functional is 
strongly convex with a positive modulus, the convergence rate of both the basic and acceler- 
ated scheme is indeed linear for both the function values and the minimizers M45ll43ll46ll . 

In our setting we use FISTA to tackle the minimization of £ T but, as we mentioned before, 
we have to deal with the computation of the proximity operator associated to Of. 

Computing the proximity operator. Since Of is one-homogeneus, i.e. Of (A/) = AOf (/) for 
A > 0, the Moreau identity, see [18|, gives a useful alternative formulation for the proximity 
operator, that is 

prox^ D = I - TTz Cn , (25) 

where C n = (<90f )(0) is the subdifferential R of tt? at the origin, and 7rx Cn : U -> U is the 
projection on ^C n - which is well defined since C n is a closed convex subset of %. To describe 
how to practically compute such a projection, we start observing that the DENOVAS penalty 
Of is the sum of d norms in W 1 . Then following Section 3.2 in [43 J (see also [29 j) we have 

C n = dOf (0) = {/ G H | / = V*v with v G -B^l , 

where B% is the cartesian product of d unitary balls in W 1 , 

B* = B n x- - -x Bn = {v = (vi,...,v d ) \v a e R n , \\v a \\ n < 1, a = l,...,d}, 

s v y 

d times 

with B n defined in ( fl2] >. Then, by definition, the projection is given by 

^c„(/) = V^, 



4 Recall that the subdifferential of a convex functional Q : H — > R U {+00} is denoted with dQ(f) and is defined 
as the set 

dQ(f) :={h€H: Q(g) - Q(J) > (h,g - />„, Vg £ H}. 
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where 

v G argmin ||/ — V*i>||^. (26) 

Being a convex constrained problem, |26| can be seen as the sum of the smooth term ||/ — V*u||^ 
and the indicator function of the convex set B^. We can therefore use p2| |, again. In fact we can 
fix an arbitrary initialization v° G M. nd and consider, 

v« +1 = vr_ Bi U - - V(VV - f)) , (27) 

for a suitable choice of r?. In particular, we notice that ttt B d can be easily computed in closed 
form, and corresponds to the proximity operator associated to the indicator function of B%. 
Applying the results mentioned above, if rj > ||VV*||, convergence of the function values of 
problem |26| on the sequence generated via pT) is guaranteed. However, since we are inter- 
ested in the computation of the proximity operator, this is not enough. Thanks to the special 
structure of the minimization problem in |26| , it is possible to prove (see IIT9ll43l ) that 

\\V*v q - V*v\\ n -»■ 0, or, equivalently \\V*v q - irz Cn (f)\\n -> 0. (28) 

A similar first-order method to compute convergent approximations of V*v has been pro- 
posed in liTTJ . 



4.5 Overall Procedure and Convergence analysis 

To compute the minimizer of £ T we consider the combination of the accelerated FB-splitting 
algorithm (outer iteration) and the basic FB-splitting algorithm for computing the proximity 
operator (inner iteration). The overall procedure is given by 

* = ^(l + V 1 + 4s ?-i) 

jt = L + ftzl^l"j f-i + (29) 
\ s t J s t 

f = (l-^)f-^*(^-y)-W, 
for t = 2,3, . . . , where v l is computed through the iteration 

" 4 = 'iH " ^ " (» " t) ~ f ~ ¥" (V " y ))) ■ (30) 

for given initializations. 

The above algorithm is an inexact accelerated FB-splitting algorithm, in the sense that the 
proximal or backward step is computed only approximately. The above discussion on the con- 
vergence of FB-splitting algorithms was limited to the case where computation of the proximity 
operator is done exactly (we refer to this case as the exact case). The convergence of the inex- 
act FB-splitting algorithm does not follow from this analysis. For the basic - not accelerated - 
FB-splitting algorithm, convergence in the inexact case is still guaranteed (without a rate) |18], 
if the computation of the proximity operator is sufficiently accurate. The convergence of the 
inexact accelerated FB-splitting algorithm is studied in [56 1 where it is shown that the same 
convergence rate of the exact case can be achieved, again provided that the accuracy in the 
computation of the proximity operator can be suitably controlled. Such a result can be adapted 
to our setting to prove the following theorem, as shown in Appendix IB] 
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Theorem 1. Let e* ~ t~ l with I > 3/2, a > \\S*S\\ + tv, rj > ||VV*||, and f given by ((29) orif/i 
computed through pO) . De/i'ne = (l — ^) /* - ^S 1 * ^S 1 /* - y). If u 1 = u 9 ,/or g sue?/ that the 
following condition is satisfied 

~n?(f)-2(V*v«J t )<e 2t , (31) 
a 

Then there exists a constant C > such that 

£ T (f)-£ T (f T )<^ 



and thus, ifv>0, 



As for the exact accelerated FB-splitting algorithm, the step size of the outer iteration has 
to be greater than or equal to L = |p*5|| + ru. In particular, we choose a = \\S*S\\ + tv and, 
similarly, r] = ||VV*||. 

We add few remarks. First, as it is evident from |32") , the choice of v > allows to obtain 
convergence of /* to f T with respect to the norm in %, and positively influences the rate of 
convergence. This is a crucial property in variable selection, where it is necessary to accurately 
estimate the minimizer of the expected risk fj, and not only its minimum £(fp). Second, con- 
dition f3l| | represents an implementable stopping criterion for the inner iteration, once that the 



representer theorem is proved. Further comments on the stopping rule are given in Section 4.6 
Third, we remark that for proving convergence of the inexact procedure, it is essential that the 
specific algorithm proposed to compute the proximal step generates a sequence belonging to 
C n and satisfying ((28). 



4.6 Further Algorithmic Considerations 

We conclude discussing several practical aspects of the proposed method. 

The finite dimensional implementation. We start by showing how the representer theorem 
can be used, together with the iterations described by ( (29) and ( |30) , to derive Algorithm]!] This 
is summarized in the following proposition. 

Proposition. For v > and f = I £. a°k Xl + i £ a P^dak)^ for any a G M n , /3° G W nd , 
the solution at step tfor the updating rule |29) is given by 



1 n 1 n d 

f l = - E ^ + - E E pUw* ( 33 ) 



n <■ — ' n 

i=l i=l a=l 



with a* and (3* defined by the updating rules ( I5p4 >, where v l in (17 1 can be estimated, starting from 
any v° G M. nd , and using the iterative rule ( fl~6") . 

The proof of the above proposition can be found in Appendix |BJ and is based on the obser- 
vation that K, Z a , Z, L a defined at the beginning of this Section are the matrices associated to 
the operators SS* : R n -> M n , SD* a : R n -> R n , SV* : R nd -> R n and D a V* : R nd -> M n , respec- 
tively. Using the same reasoning we can make the following two further observations. First, 
one can compute the step sizes a and rj as a = ||K|| + tv, and 77 = ||L||. Second, since in practice 
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we have to define suitable stopping rules, Equations f32| | and (28 1 suggest the following choices 

El 

II/' - f-'Wn < e (ext) and — «f (/*) - 2(VV,/ 4 ) < . 

a 

As a direct consequence of ( |33| and using the definition of matrices K, Z, L, these quantities can 
be easily computed as 



a 



Hf-f-% = (5a,K5a) n + 2{5a,Z5f3) n + {5f3,L5f3) 1 
d 

n?(f)-2(V*v«J t ) = ^\\Z a a t + L a p t \\-2{a t ,Zv'') n . 



a=l 



where we defined 5a = a* — a* -1 and 5f3 = (3 1 — /3* _1 . Also note that, according to Theorem [lj 
e (int) mus t depend on the outer iteration as = e* ~ i~ 2/ , Z > 3/2. 

Finally we discuss a criterion for identifying the variables selected by the algorithm. 

Selection. Note that in the linear case f(x) = j3 ■ x the coefficients /3 1 , . . . , (3 d coincide with the 
partial derivatives, and the coefficient vector (3 given by I 1 regularization is sparse (in the sense 
that it has zero entries), so that it is easy to detect which variables are to be considered relevant. 
For a general non-linear function, we then expect the vector (||-Da/|| n )a=i °f tne norms of the 
partial derivatives evaluated on the training set points, to be sparse as well. In practice since 
the projection -ir T / aB d is computed only approximately, the norms of the partial derivatives will 
be small but typically not zero. The following proposition elaborates on this point. 

Proposition. Let v = (v a ) d =1 G B d such that, for any a > 

v*v = -\v(£{h+rv\\rt n ), 

then 

K||n < " llAJln = 0. (34) 

a 

Moreover, ifv 1 is given by Algorithm^with the inner iteration stopped when the assumptions of The- 
orem [l] are met, then there exists e* > (precisely defined in I40) ) depending on the tolerance e* 
used in the inner iteration and satisfying lim^o^ 4 = 0, such that if m := min{||l) a / T || n : a e 

{l,...,d}s.t.||AJln >0}. 

Kl|n>-- V L ^P«/ T |U = 0- (35) 
a 2m 



The above result, whose proof can be found in Appendix |BJ is a direct consequence of the 
Euler equation for £ T and of the characterization of the subdifferential of The second 
part of the proof follows by observing that, as V*v belongs to the subdifferential of Slf at 
f T ,V*v t belongs to the approximate subdifferential of fif at f T , where the approximation of the 
subdifferential is controlled by the precision used in evaluating the projection. Given the pair 
(/*, v l ) evaluated via Algorithm]!} we can thus consider to be irrelevant the variables such that 
ll^alU < t/u — (e 1 ) 2 /{2m). Note that the explicit form of e* is given in |40|). 



5 In practice we often use a stopping rule where the tolerance is scaled with the current iterate, || /' ' ' : 

e^WfWn and ^fif (/*) - 2(V*^,/ i ) < e™\\ VV||«. 
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5 Consistency for Learning and Variable Selection 



In this section we study the consistency properties of our method. 



5.1 Consistency 

As we discussed in Section 3.1 though in practice we consider the regularizer Of defined 
in (|4), ideally we would be interested into Of (/) = Ylt=i \\Daf\\px' / G The following 
preliminary result shows that indeed Of is a consistent estimator of Of when considering 
functions in % having uniformly bounded norm. 

Theorem 2. Let r < oo, then under assumption {AT) 

lim P( sup inf (/) > e ) = Ve>0. 

n ^°° \\\S\\u<r J 

The restriction to functions such that \\f\\n < r is natural and is required since the penalty 
Of forces the partial derivatives to be zero only on the training set points. To guarantee that 
a partial derivative, which is zero on the training set, is also close to zero on the rest of the 
input space, we must control the smoothness of the function class where the derivatives are 
computed. This motivates constraining the function class by adding the (squared) norm in T-L 
into d5). This is in the same spirit of the manifold regularization proposed in [12J. 

The above result on the consistency of the derivative based regularizer is at the basis of the 
following consistency result. 

Theorem 3. Under assumptions Al, A2 and A3, recalling that £(f) = J(y — f(x)) 2 dp(x, y), 



lim P £(f n ) - inf £(f) > e = Ve > 0, 
?woo y /en J 

for any r n satisfying 

Tn^O {VnTnY 1 -)• 0. 

The proof is given in the appendix and is based on a sample/ approximation error decom- 
position 

£(h - inf £{f) < \£(fT)- £ T {r) \ + \ £ T {n _ i n f 

j£n ^ y ' /Sri 

sample error 



where 



approximation error 



£ T (f) := £(f) + 2r«?(/) + ru\\f\\^ f := argmin^. 

H 



The control of both terms allows to find a suitable parameter choice which gives consistency. 
When estimating the sample error one has typically to control only the deviation of the empir- 
ical risk from its continuos counterpart. Here we need Theorem [2] to also control the deviation 
of Of from Of. Note that, if the kernel is universal HBTl , then inf £(f) = £(f p ) and Theorem 
[3] gives the universal consistency of the estimator f Tn . 

To study the selection properties of the estimator f Tn - see next section- it useful to study 
the distance of f Tn to f p in the %-norm. Since in general f p might not belong to %, for the sake 
of generality here we compare f Tn to a minimizer of inf £ (/) which we always assume to 



17 



exist. Since the minimizers might be more then one we further consider a suitable minimal 
norm minimizer fj,- see below. More precisely given the set 

T H :={feH\S(f)= inf £(/)} 

(which we assume to be not empty), we define 

/t:=argmm{n?(/) + i ,||/|&}. 

Note that fj is well defined and unique, since £7f (-) + u\\ ■ ||^ is strongly convex and £ is convex 
and lower semi-continuous on T-L, which implies that is closed and convex in T-L. Then, we 
have the following result. 

Theorem 4. Under assumptions Al, A2 and A3, we have 

lim P(\\h-fl\\H>*)=0, Ve>0, 



for any r n such that r n — > and (\A* r n) 1 — > 0. 

The proof, given in Appendix |c| is based on the decomposition in sample error, ||/ T — f T \\u, 
and approximation error, \\f T — fj>\\n- To bound the sample error we use recent results [55 j that 
exploit Attouch-Wets convergence HHHE! and coercivity of the penalty (ensured by the RKHS 
norm) to control the distance between the minimizers f T , f T by the distance the minima £ T (f T ) 
and £ T (f T ). Convergence of the approximation error is again guaranteed by standard results 
in regularization theory \26\. We underline that our result is an asymptotic one, although it 



would be interesting to get an explicit learning rate, as we discuss in Section 5.3 



5.2 Selection properties 

We next consider the selection properties of our method. Following Equation j3|, we start by 
giving the definition of relevant /irrelevant variables and sparsity in our context. 

Definition 1. We say that a variable a = 1, . . . , d is irrelevant with respect to pfor a differentiable 
function /, if the corresponding partial derivative D a f is zero px-almost everywhere, and relevant 
otherwise. In other words the set of relevant variables is 

R f :={a€{l,...,d}|||D o /||^>0}. 

We say that a differentiable function f is sparse ifVL®(f) := \Rf \ < d. 

The goal of variable selection is to correctly estimate the set of relevant variables, R p := Ra. 
In the following we study how this can be achieved by the empirical set of relevant variables, 
R T ", defined as 

fT" := {aG{l,...,d}||| J D a r"||„>0}. 
Theorem 5. Under assumptions Al, A2 and A3 

lim pf R p C R T A = 1 

for any r n satisfying 
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The above result shows that the proposed regularization scheme is a safe filter for variable 
selection, since it does not discard relevant variables, in fact, for a sufficiently large number 
of training samples, the set of truly relevant variables, R p , is contained with high probability 
in the set of relevant variables identified by the algorithm, R Tn . The proof of the converse 
inclusion, giving consistency for variable selection (sometimes called sparsistency), requires 
further analysis that we postpone to a future work (see the discussion in the next subsection). 

5.3 Learning Rates and Sparsity 

The analysis in the previous sections is asymptotic, so it is natural to ask about the finite sam- 
ple behavior of the proposed method, and in particular about the implication of the sparsity 
assumption. Indeed, for a variety of additive models it is possible to prove that the sample 
complexity (the number of samples needed to achieve a given error with a specified probabil- 
ity) depends linearly on the sparsity level and in a much milder way to the total number of 
variables, e.g. logarithmically I1T4I . Proving similar results in our setting is considerably more 
complex and in this section we discuss the main sources of possible challenges. 

Towards this end, it is interesting to contrast the form of our regularizer to that of structured 
sparsity penalties for which sparsity results can be derived. Inspecting the proof in Appendix 
|c| one can see that it possible to define a suitable family of operators Vj,Vj : % — > %, with 
j = 1 , . . . , d, such that 

d d 

«?(/) = £ W\\u, = £ H^i/Htt- 06) 

i=i j=i 

The operators (VJ-)j are positive and self-adjoint and so are the operators (Vj)j- The latter can 
be shown to be stochastic approximation of the operators (Vj)j, in the sense that the equalities 
Eft/ 2 ] = Vf hold true for all j = 1, . . . , d. 

It is interesting to compare the above expression to the one for the group lasso penalty, 
where for a given linear model, the coefficients are assumed to be divided in groups, only few of 
which are predictive. More precisely, given a collection of groups of indices Q = {G±, . . . , G r }, 
which forms a partition of the set {1, ... , p}, and a linear model f(x) = (/3, x)rp, the group lasso 
penalty is obtained by considering 

r 

^(/3)=EH/Hll R l°7b 
7=1 

where, for each 7, /3\q is the |G 7 | dimensional vector obtained restricting a vector (3 to the 
indices in G 7 . If we let P 7 be the orthogonal projection on the subspace of M rf corresponding 
G 7 -th group of indices, we have that (P 7 /3, P 7 /3') = for all 7,7' £ T and /?, f3' e MP, since the 
groups form a partition of {1, ... ,p}. Then it is possible to rewrite the group lasso penalty as 

r 

n^) = J2\\P 7 P\\^. 

7=1 

The above idea can be extended to an infinite dimensional setting to obtain multiple kernel 
learning (MKL). Let % be a (reproducing kernel) Hilbert space which is the sum of T disjoint 
(reproducing kernel) Hilbert spaces (% 7 , IH^Ker/ and P 7 : % — > % 7 the projections of % onto 
% 7 , then MKL is induced by the penalty 

Q MKL {f) = ||p 7 /|| 7 . 

7er 
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When compared to our derivative based penalty, see ( 36 I, one can notice at least two source 
of difficulties: 



1. the operators (Vj)j are not projections and no simple relation exists among their ranges, 

2. in practice we have only access to the empirical estimates (Vj)j. 

Indeed, structured sparsity model induced by more complex index sets have been consid- 
ered, see for example 1351 , but the penalties are still induced by operators which are orthogo- 
nal projections. Interestingly, a class of penalties induced by a (possibly countable) family of 
bounded operators V = {V 7 } ye r- not necessarily projections- has been considered in [41J. This 
class of penalties can be written as 

n W (/) = mf{^ ||/ 7 || | / 7 € ft,J> 7 / 7 = /}. 

7 er 7 er 

It is easy to see that the above penalty does not include the regularizer |36| as a special case. 

In conclusion, rewriting our derivative based regularizer as in |36| highlights similarity and 
differences with respect to previously studied sparsity methods: indeed many of these methods 
are induced by families of operators. On the other hand, typically, the operators are assumed 
to satisfy stringent assumptions which do not hold true in our case. Moreover in our case one 
would have to overcome the difficulties arising from the random estimation of the operators. 
These interesting questions are outside of the scope of this paper, will be the subject of future 
work. 



6 Empirical Analysis 

The content of this section is divided into three parts. First, we describe the choice of tuning 
parameters. Second, we study the properties of the proposed method on simulated data un- 
der different parameter settings, and third, we compare our method to related regularization 
methods for learning and variable selection. 

When we refer to our method we always consider a two-step procedure based on variable 
selection via Algorithm[l]and regression on the selected variable via (kernel) Regularized Least 
Squares (RLS). The kernel used in both steps is the same. Where possible, we applied the same 
reweighting procedure to the methods we compared with. 



6.1 Choice of tuning parameters 

When using AlgorithmJTJ once the parameters n and v are fixed, we evaluate the optimal value 
of the regularization parameter r via hold out validation on an independent validation set of 
rival = n samples. The choice of the parameter v, and its influence on the estimator is discussed 
in the next section. 

Since we use an iterative procedure to compute the solution, the output of our algorithm 



will not be sparse in general and a selection criterion is needed. In Subsection 4.6 we discussed 
a principled way to select variable using the norm of the coefficients {v l a )„ =1 . 

When using MKL, t 1 regularization, and RLS we used hold out validation to set the regular- 
ization parameters, while for COSSO and HKL we used the choices suggested by the authors. 
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6.2 Analysis of Our Method 

6.2.1 Role of the smoothness enforcing penalty v\\ • [||> 

From a theoretical stand point we have shown that v has to be nonzero, in order for the pro- 
posed regularization problem (|5) to be well-posed. We also mentioned that the combination 
of the two penalties f)f and ensures that the regularized solution will not depend on 
variables that are irrelevant for two different reasons. The first is irrelevance with respect to 
the output. The second type of irrelevance is meant in an unsupervised sense. This happens 
when one or more variables are (approximately) constant with respect to the marginal distri- 
bution px, so that the support of the marginal distribution is (approximately) contained in a 
coordinate subspace. Here we present two experiments aimed at empirically assessing the role 
of the smoothness enforcing penalty ||-||^ and of the parameter v. We first present an experi- 
ment where the support of the marginal distribution approximately coincides with a coordinate 
subspace x 2 = 0. Then we systematically investigate the stabilizing effect of the smoothness 
enforcing penalty also when the marginal distribution is not degenerate. 

Adaption to the Marginal Distribution We consider a toy problem in 2 dimensions, where 
the support of the marginal distribution px{x l ,x 2 ) approximately coincides with the coordi- 
nate subspace x 2 = 0. Precisely x 1 is uniformly sampled from [—1, 1], whereas x 2 is drawn 
from a normal distribution x 2 ~ Af(0, 0.05). The output labels are drawn from y = (x 1 ) 2 + w, 
where w is a white noise, sampled from a normal distribution with zero mean and variance 0.1. 
Given a training set of n = 20 samples i.i.d. drawn from the above distribution (Figure [^top- 
left), we evaluate the optimal value of the regularization parameter r via hold out validation 
on an independent validation set of n va i = n = 20 samples. We repeat the process for v = 
and v = 10. In both cases the reconstruction accuracy on the support of px is high, see Figure 
[3]bottom-right . However, while v = 10 our method correctly selects the only relevant variable 
x 1 (Figure plbottom-left), when v = both variables are selected (Figure |3]bottom-center), since 
functional £ T '° is insensible to errors out of supp(px), and the regularization term rfif alone 
does not penalizes variations out of supp(p^) . 

Effect of varying v Here we empirically investigate the stabilizing effect of the smoothness 
enforcing penalty when the marginal distribution is not degenerate. The input variables x = 
(x 1 ,...,^ 20 ) are uniformly drawn from [— 1,1] 20 . The output labels are i.i.d. drawn from 
y = A Yl a =i J2t=a+i xClxb + w ' where w ~ AA(0, 1), and A is a rescaling factor that determines 
the signal to noise ratio to be 15:1. We extract training sets of size n which varies from 40 to 
120 with steps of 10. We then apply our method with polynomial kernel of degree p = 4, 
letting vary v in {0.1, 0.2, 0.5, 1, 2, 5, 10, 20}. For fixed n and v we evaluate the optimal value 
of the regularization parameter r via hold out validation on an independent validation set of 
n vai = n samples. We measure the selection error as the mean of the false negative rate (fraction 
of relevant variables that were not selected) and false positive rate (fraction of irrelevant vari- 
ables that were selected). Then, we evaluate the prediction error as the root mean square error 
(RMSE) error of the selected model on an independent test set of n tes t = 500 samples. Finally 
we average over 50 repetitions. 

In Figure|4]we display the prediction error, selection error, and computing time, versus n for 
different values of v. Clearly, if v is too small, both prediction and selection are poor. For v > 1 
the algorithm is quite stable with respect to small variations of v. However, excessive increase 
of the smoothness parameter leads to a decrease in prediction and selection performance. In 
terms of computing time, the higher the smoothness parameter the better the performance. 
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Figure 3: Effect of the combined regularization Uf (•) + v \ \ ■ [| 2, on a toy problem in M. 2 where the 
support of marginal distribution approximately coincides with the coordinate subspace x 2 = 0. 
The output labels are drawn from y = (x 1 ) 2 + w, with w ~ W(0, 0.1). 



training points 



0.75 - 




Figure 4: Selection error (left), prediction error (center), and computing time (right) versus n 
for different values of v. The points correspond to the mean over the repetitions. The dotted 
line represents the white noise standard deviation. In the left figure the curves corresponding 
to v = 5, v = 10, and v = 20 are overlapping. 
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6.2.2 Varying the model's parameters 

We present 3 sets of experiments where we evaluated the performance of our method (DEN- 
OVAS) when varying part of the inputs parameters and leaving the others unchanged. The 
parameters we take into account are the following 

• n, training set size 

• d, input space dimensionality 

• | R p \, number of relevant variables 

• p, size of the hypotheses space, measured as the degree of the polynomial kernel. 

In all the following experiments the input variables x = (x 1 , . . . , x d ) are uniformly drawn 
from [—1, l] d . The output labels are computed using a noise-corrupted regression function / 
that depends nonlinearly from a set of the input variables, i.e. y = Xf(x) + w, where w is a 
white noise, sampled from a normal distribution with zero mean and variance 1, and A is a 
rescaling factor that determines the signal to noise ratio. For fixed n, d, and \R P \ we evaluate 
the optimal value of the regularization parameter r via hold out validation on an independent 
validation set of n va j = n samples. 

Varying n, d, and \R P \ In this experiment we want to empirically evaluate the effect of the 
input space dimensionality, d, and the number of relevant variables, \R P \, when the other pa- 
rameters are left unchanged. In particular we use d = 10, 20, 30, 40 and \R p \ = 2, 3, 4, 5, 6. For 

each value of \R P \ we use a different regression function, f(x) = XY^}a=i Xl=a+i c abx a x b , so 
that for fixed \R P \ all 2-way interaction terms are present, and the polynomial degree of the 
regression function is always 2. The coefficients c a b are randomly drawn from [.5, 1] And A is 
determined in order to set the signal to noise ratio as 15:1. We then apply our method with 
polynomial kernel of degree p = 2. The regression function thus always belongs to the hy- 
potheses space. 

In Figure |5j we display the selection error, and the prediction error, respectively, versus n for 
different values of d and number of relevant variables | R p \ . Both errors decrease with n and in- 
crease with d and \R P \. In order to better visualize the dependance of the selection performance 
with respect to d and \R P \, in Figure [6] we plotted the minimum number of input points that are 
necessary in order to achieve 10% of average selection error. It is clear by visual inspection that 
\R P \ has a higher influence than d on the selection performance of our method. 

Varying n and p In this experiment we want to empirically evaluate the effect of the size 
of the hypotheses space on the performance of our method. We therefore leave unchanged 
the data generation setting, made exception for the number of training samples, and vary the 
polynomial kernel degree as p = 1,2,3,4, 5, 6. We let d = 20, R p = {1, 2}, and f(x) = x 1 x 2 , and 
let vary n from 20 to 80 with steps of 5. The signal to noise ratio is 3:1. 

In Figure |7j we display the prediction and selection error, versus n, for different values of p. For 
p > 2, that is when the hypotheses space contains the regression function, both errors decrease 
with n and increase with p. Nevertheless the effect of p decreases for large p, in fact for p = 4, 5, 
and 6, the performance is almost the same. On the other hand, when the hypotheses space 
is too small to include the regression function, as for the set of linear functions (p = 1), the 
selection error coincides with chance (0.5), and the prediction error is very high, even for large 
numbers of samples. 
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Figure 5: Prediction error (top) and selection error (bottom) versus n for different values of d 
and number of relevant variables (|-R P |). The points correspond to the means over the repeti- 
tions. The dotted line represents the white noise standard deviation. 



$ 10 





, in 1=2 
* e 
IR 1=3 

IR 1=4 


Ik 


- IR 1=5 

e 

IR 1=6 





80 100 120 





Figure 6: Minimum number of input points (n) necessary to achieve 10% of average selection 
error versus the number of relevant variables | R p | for different values of d (left), and versus d 
for different values of \R p \ (right). 




Varying the number of relevant features, for fixed \ R P \: comparison with £ regularization on 
the feature space In this experiment we want to empirically evaluate the effect of the number 
of features involved in the regression function ( that is the number of monomials constituting 
the polynomial) on the performance of our method when | R p | remains the same as well as all 
other input parameters. Note that while \R P \ is the number of relevant variables, here we vary 
the number of relevant features (not variables!), which, in theory, has nothing to do with \R P \. 
Furthermore we compare the performance of our method to that of I 1 regularization on the 
feature space (i 1 -features). We therefore leave unchanged the data generation setting, made 
exception for the regression function. We set d = 10, R p = {1, 2, 3}, n = 30, and then use a 
polynomial kernel of degree 2. The signal to noise ratio is this time 3:1. Note that with this 
setting the size of the features space is 66. For fixed number of relevant features the regression 
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Figure 7: Prediction error (left) and selection error (right) versus n for different values of the 
polynomial kernel degree (p). The points correspond to the means over the repetitions. The 
dotted line represents the white noise standard deviation. 




Figure 8: Prediction error (left) selection error (right) versus the number of relevant features. 
The points correspond to the means over the repetitions. The dotted line represents the white 
noise standard deviation. 




number of relevant features number relevant features 



function is set to be a randomly chosen linear combination of the features involving one or two 
of the first three variables (x 1 , (x 1 ) 2 , x l x 2 , x 1 ^ 3 , etc.), with the constraint that the combination 
must be a polynomial of degree 2, involving all 3 variables. 

In Figure [8j we display the prediction and selection error, versus the number of relevant fea- 
tures. While the performance of ^-features fades when the number of relevant features in- 
creases, our method presents stable performance both in terms of selection and prediction er- 
ror. From our simulation it appears that, while our method depends on the number of relevant 
variables, it is indeed independent of the number of features. 

6.3 Comparison with Other Methods 

In this section we present numerical experiments aimed at comparing our method with state- 
of-the-art algorithms. In particular, since our method is a regularization method, we focus on 
alternative regularization approaches to the problem of nonlinear variable selection. For com- 
parisons with more general techniques for nonlinear variable selection we refer the interested 
reader to 0. 
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6.3.1 Compared algorithms 

We consider the following regularization algorithms: 

• Additive models with multiple kernels, that is Multiple Kernel Learning (MKL) 

• I 1 regularization on the feature space associated to a polynomial kernel (^-features) 

• COSSO EHJ with 1-way interaction (COSSOl) and 2-way interactions (COSSO2)0 

• Hierarchical Kernel Learning [7] with polynomial (HKL pol.) and hermite (HKL herm.) 
kernel 

• Regularized Least Squares (RLS). 

Note that, differently from the first 4 methods, RLS is not a variable selection algorithm, how- 
ever we consider it since it is typically a good benchmark for the prediction error. 

For ^-features and MKL we use our own Matlab implementation based on proximal meth- 
ods (for details see 031). For COSSO we used the Matlab code available at |www . stat . wise . | 
ledu/ -yilinl or Iwww4 . stat . ncsu . edu/ ~hzhang| which can deal with 1 and 2-way inter- 
actions. For HKL we used the code available online at h ttp : / /www, di . ens . f r/~fbach/| 
hkl/ index . html] While for MKL and ^-features we are able to identify the set of selected 
variables, for COSSO and HKL extracting the sparsity patterns from the available code is not 
straightforward. We therefore compute the selection errors only for ^-features, MKL, and our 
method . 

6.3.2 Synthetic data 

We simulated data with d input variables, where each variable is uniformly sampled from [- 
2,2]. The output y is a nonlinear function of the first 4 variables, y = f(x , x 2 , x 3 ,x 4 ) + e, where 
epsilon is a white noise, e ~ AA(0, a 2 ), and a is chosen so that the signal to noise ratio is 15:1. 
We consider the 4 models described in Table Q] 



Table 1: Synthetic data design 



number of 
relevant variables 



n 



model (/) 



additive p=2 
2way p=2 
3way p=6 
radial 



40 
40 
40 
20 



4 
4 
3 
2 



100 
100 
100 

100 y 



a\2 



b=a+l 



XX 



y r 



(x 1 x 2 x 3 ) 2 
(x 2 ) 2 )e 



-((x^+fx 8 ) 2 ) 



For model selection and testing we follow the same protocol described at the beginning of 
Section|6j with n = 100, 100 and 1000 for training, validation and testing, respectively. Finally 
we average over 20 repetitions. In the first 3 models, for MKL, HKL, RLS and our method we 

6 In all toy data, and in part of the real data, the following warning message was displayed: 

Maximum number of iterations exceeded; increase options . Maxlter . 

To continue solving the problem with the current solution as the starting point, 
set xO = x before calling lsqlin. 

In those cases the algorithm did not reach convergence in a reasonable amount of time, therefore the error bars 
corresponding to COSS02 were omitted. 
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employed the polynomial kernel of degree p, where p is the polynomial degree of the regression 
function /. For ^-features we used the polynomial kernel with degree chosen as the minimum 
between the polynomial degree of / and 3. This was due to computational reasons, in fact, 
with p = 4 and d = 40, the number of features is highly above 100, 000. For the last model, we 
used the polynomial kernel of degree 4 for MKL, ^-features and HKL, and the Gaussian kernel 
with kernel parameter a = 2 for RLS and our method^] COSS02 never reached convergence. 
Results in terms of prediction and selection errors are reported in Figure [9j 

Figure 9: Prediction error (top) and fraction of selected variables (bottom) on synthetic data 
for the proposed method (DENOVAS), multiple kernel learning for additive models (MKL), i 1 
regularization on the feature space associated to a polynomial kernel {l l -features), COSSO with 
1-way interactions (COSSOl), hierarchical kernel learning with polynomial kernel (HKL pol.), 
and regularized least squares (RLS). The dotted line in the upper plot corresponds to the white 
noise standard deviation. Selection errors for COSSO, and HKL are not reported because they 
are not straightforwardly computable from the available code. 
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When the regression function is simple (low interaction degree or low polynomial degree) 
more tailored algorithms, such as MKL-which is additive by design- for the experiment "ad- 
ditive p=2", or ^-features for experiments "2way p=4" - in this case the dictionary size is less 
than 1000-, compare favorably with respect to our method. However, when the nonlinearity 
of the regression function favors the use of a large hypotheses space, our method significantly 
outperforms the other methods. This is particularly evident in the experiment "radial", which 
was anticipated in Section [3} where we plotted in Figure [2] the regression function and its esti- 
mates obtained with the different algorithms for one of the 20 repetitions. 



7 Note that here we are interested in evaluating the ability of our method of dealing with a general kernel like 
the Gaussian kernel, not in the choice of the kernel parameter itself. Nonetheless, a data driven choice for a will be 



presented in the real data experiments in Subsection 6.3.3 
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6.3.3 Real data 

We consider the 7 benchmark data sets described in Table |2j We build training and validation 



Table 2: Real data sets 





number of 


number of 






data name 


input variables 


instances 


source 


task 


boston housing 


13 


506 


LIACCQ 


regression 


census 


16 


22784 


LIACC 


regression 


delta ailerons 


5 


7129 


LIACC 


regression 


stock 


10 


950 


LIACC 


regression 


image segmentation 


18 


2310 


IDA0 


classification 


pole telecomm 


260 


15000 


LIACC 


regression 


breast cancer 


32 


198 


UClQ 


regression 



sets by randomly drawing n tra in and n ya \ samples, and using the remaining samples for testing. 
For the first 6 data sets we let n tra in = n va \ = 150, whereas for breast cancer data we let n tr ain = 



n vai = 60. We then apply the algorithms described in Subsection 6.3.1 with the validation 



protocol described in Section [6] For our method and RLS we used the gaussian kernel with 
the kernel parameter a chosen as the mean over the samples of the euclidean distance form 
the 20-th nearest neighbor. Since the other methods cannot be run with the gaussian kernel we 
used a polynomial kernel of degree p = 3 for MKL and ^-features. For HKL we used both the 
polynomial kernel and the hermite kernel, both with p = 3. Results in terms of prediction and 



selection error are reported in Figure 10 



Some of the data, such as the stock data, seem not to be variable selection problem, in fact 
the best performance is achieved by our method though selecting (almost) all variables, or, 
equivalently by RLS. Our method outperforms all other methods on several data sets. In most 
cases, the performance of our method and RLS are similar. Nonetheless our method brings 
higher interpretability since it is able to select a smaller subset of relevant variable, while the 
estimate provided by RLS depends on all variables. 

We also run experiments on the same 7 data sets with different kernel choices for our 
method . We consider the polynomial kernel with degree p = 2, 3 and 4, and the gaussian 
kernel. Comparisons among the different kernels in terms of prediction and selection accu- 



racy are plotted in Figure 11 Interestingly the choice of the gaussian kernel seems to be the 



preferable choice in most data sets. 



7 Discussion 

Sparsity based method has recently emerged as way to perform learning and variable selec- 
tion from high dimensional data. So far, compared to other machine learning techniques, this 
class of methods suffers from strong modeling assumptions and is in fact limited to paramet- 
ric or semi-parametric models (additive models). In this paper we discuss a possible way to 
circumvent this shortcoming and exploit sparsity ideas in a non-parametric context. 

We propose to use partial derivatives of functions in a RKHS to design a new sparsity 
penalty and a corresponding regularization scheme. Using results from the theory of RKHS 
and proximal methods we show that the regularized estimator can be provably computed 
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Figure 10: Prediction error (top) and fraction of selected variables (center) and computing time (bottom) 
on real data for the proposed method (DENOVAS), multiple kernel learning for univariate additive 
functions (MKL), I 1 regularization on the feature space associated to a polynomial kernel (^ 1 -features), 
COSSO with 1-way interactions (COSSOI), hierarchical kernel learning with polynomial kernel (HKL 
pol.), hierarchical kernel learning with hermite kernel (HKL herm.) and regularized least squares (RLS). 
Prediction errors for COSS02 are not reported because it is always outperformed by COSSOI. such 
errors were still too large to report in the first three data sets, and were not available since the algorithm 
did not reach convergence for image segmentation, pole telecomm and breast cancer data. To make 
the prediction errors comparable among experiments, root mean squared errors (RMSE) were divided 
by the outputs standard deviation, which corresponds to the dotted line. Error bars are the standard 
deviations of the normalized RMSE. Though the largest normalized RMSE appear out of the figure 
axis, we preferred to display the prediction errors with the current axes limits in order to allow the 
reader to appreciate the difference between the smallest, and thus most significant, errors. Selection 
errors for COSSO, and HKL are not reported because they are not straightforwardly computable from 
the available code. The computing time corresponds to the entire model selection and testing protocol. 
Computing time for RLS is not reported because it was always negligible with respect to the other 
methods. 
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Figure 11: Prediction error (top) and fraction of selected variables (bottom) on real data for our 
method with different kernels: polynomial kernel of degree p = 1 , 2 and 3 (DENOVAS pol-p), 
and Guassian kernel (DENOVAS gauss). Error bars represent the standard deviations. In order 
to make the prediction errors comparable among experiments, root mean square errors were 
divided by the outputs standard deviation, which corresponds to the dotted line. 



through an iterative procedure. The consistency property of the proposed estimator are stud- 
ied. Exploiting the non-parametric nature of the method we can prove universal consistency. 
Moreover we study selection properties and show that that the proposed regularization scheme 
represents a safe filter for variable selection, as it does not discard relevant variables. Exten- 
sive simulations on synthetic data demonstrate the prediction and selection properties of the 
proposed algorithm. Finally, comparisons to state-of-the-art algorithms for nonlinear variable 
selection on toy data as well as on a cohort of benchmark data sets, show that our approach 
often leads to better prediction and selection performance. 

Our work can be considered as a first step towards understanding the role of sparsity be- 
yond additive models. It substantially differs with respect to previous approaches based on 
local polinomial regression |37l [131 |Miller an d Hall(2010) |, since it is a regularization scheme 
directly performing global variable selection. The RKHSs' machinery allows on the one hand 
to find a computationally efficient algorithmic solution, and on the other hand to consider very 
general probability distributions p, which are not required to have a positive density with re- 
spect to the Lebesgue measure (differently from [20]). Several research directions are yet to be 
explored. 

• From a theoretical point of view it would be interesting to further analyzing the sparsity 
property of the obtained estimator in terms of finite sample estimates for the prediction 
and the selection error. 



30 



• From a computational point of view the main question is whether our method can be 
scaled to work in very high dimensions. Current computations are limited by memory 
constraints. A variety of method for large scale optimization can be considered towards 
this end. 

• A natural by product of computational improvements would be the possibility of con- 
sidering a semi-supervised setting which is naturally suggested by our approach. More 
generally we plan to investigate the application of the RKHS representation for differen- 
tial operators in unsupervised learning. 

• More generally our study begs the question of whether there are alternative /better ways 
to perform learning and variable selection beyond additive models and using non para- 
metric models. 
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A Derivatives in RKHS and Representer Theorem 

Consider L 2 (X,p x ) = {/ : X — > R measurable| / \f(x)\ 2 dp x (x) < 00} and R n with inner product 
normalized by a factor 1 jn, \ || n . 

The operator If. : H —> L 2 (X,p x ) defined by (If.f)(x) — (f,k x )n, for almost all x G X, is well- 
defined and bounded thanks to assumption Al. The sampling operator |i"9] | can be seen as its empirical 
counterpart. Similarly D a : % -> L 2 (X,p x ) defined by (D a f)(x) = (/, (d a k) x ), for almost all x e X 
and a — 1, ... d, is well-defined and bounded thanks to assumption A2. The operator | [2~l) can be seen 
as its empirical counterpart. Several properties of such operators and related quantities are given by the 
following two propositions. 

Proposition. If assumptions Al and A2 are met, the operator I k and the continuous partial derivative D a are 
Hilbert-Schmidt operators from H to L 2 (X, p x ), and 
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Proposition. If assumptions Al and A2 are met, the sampling operator S and the empirical partial derivative 
D a are Hilbert-Schmidt operators from H. to R™, and 

n 1 71 

S*v = - } k Xt Vi, S*Sf = -}k Xi (f,k Xi )n 

i=l i=l 
-, n 1 n 

D* a v = -y2(d a k) Xi vi, b* a D b f = -y2(d a k) Xi (f,(d b k) Xi ) n 

71 ' ' 71 ' ' 



n *■ — ' n 

i=l i=l 



where a, b = 1, . . . , d. 

The proof can be found in l22l for Ik and S, where assumption Al is used. The proof for D a and D a 
is based on the same tools and on assumption A2. Furthermore, a similar result can be obtained for the 
continuous and empirical gradient 

V :n^(L 2 {X,p x )) d 1 Vf = (DJ) d a=1 

\7:H^(R n ) d , W=(Aj)Li, 

which can be shown to be Hilbert-Schmidt operators from H to (L 2 (X, px)) d and from H to (K n ) d , re- 
spectively. 



We next restate Propo sitio n 4.3 in a slightly more abstract form and give its proof. 
Proposition [Proposition 4.3 Extended] Theminimizer of |t} satisfies f T e Range(S*) + Range(\7*). Hence- 
forth it satisfies the following representer theorem 



n d 



f- = S*a + V*/3 = V - ai k Xi + V V -f3 a ,i(d a k) Xi (37) 

Z ✓ 7 7 Z ✓ Z ✓ 77 



n ^ — ' * — ' n 

2—1 i=l a— 1 



anffc a£l" and f3 G M nd . 

Proof. Being Range(S'*) + Range(V*) a closed subspace of "H, we can write any function / e Mas 
/ = f 11 + /\ where /// e Range + Range(V*) and (/- L ,5)« for all g € Range (S*) + Range(V*). 
Now if we plug the decomposition / = /" + f 1 - in the variational problem we obtain 

r= argmin {£(///) + 2r0f (///)+ + 

f=f"+f ± k J 



which is clearly minimized by J 1 - = 0. The second equality in 1 37 1 then derives directly from definition 

of S* and V*. □ 

We conclude with the following example on how to compute derivatives and related quantities for 
the Gaussian Kernel. 



Example 1. Note that all the terms involved in (33 1 are explictly computable. As an example we show how to 

_n*- S |i 2 

compute them when k(x, s) = e 2 ~< 2 is the gaussian kernel on R d . By definition 

dk(s, •) 



(d a k) Xi (x) 



ds a 

Given x e X it holds 
dk(s, x) 



i k.. 



x/n- 

S—Xi 



= e ■ => (d a k) Xi (x) = e 



ds a V 7 2 / V 7 2 
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Moreover, as we mentioned above, the computation of/3^ i and a\ requires the knowledge of matrices K, Z , Z, L a . 
Also their entries are easily found starting from the kernel and the training points. We only show how the entries 
of Zand L a look like. Using the previous computations we immediately get 



„a _ a 



7 2 



[Za]ij = e 

In order to compute L a we need the second partial derivatives of the kernel: 

II 2 a a b b 



dk(s, x) 
dx b ds a 



so that 



-e-^-r^-^r) ifa = b. 



if a ^ b 



B Proofs of Section H] 



-e 2 -r 



In this appendix we collect the proofs related to the derivation of the iterative procedure given in Algo- 
rithm[T| Theorem[T|is a consequence of the general results about convergence of accelerated and inexact 
FB-splitting algorithms in ||56| . In that paper it is shown that inexact schemes converge only when the 
errors in the computation of the proximity operator are of a suitable type and satisfy a sufficiently fast 
decay condition. We first introduce the notion of admissible approximations. 

Definition 2. Let e > and A > 0. We say that h G W is an approximation of prox A ^r, (/) with e-precision 
and we write h ~ e prox A ^ (/), if and only if 

tzhed^afih), (38) 

A 2A 

where denotes the e-subdijferential^^ 

We will need the following results from 



Theorem 6. Consider the following inexact version of the accelerated FB-algorithm in 1 22 1 with c\,t and c 2 .t as 
in p4) 

1 



f ~ £t prox^ ? [[I - — VF) (ci.tf- 1 + C2, t f-')) ■ (39) 
Then, if V ~ \/t l with I > 3/2, there exists a constant C > such that 

Proposition. Suppose that Q : % — > K U {+00} can fee written as Q = uj o B, where B : % —¥ Q is a linear 
and bounded operator between Hilbert spaces and lo : Q — > K U {+00} is a one-homogeneous function such that 
S := 9w(0) is bounded. Then for any f £% and any v € 5 swcft f/iaf 

2Xu(Bf) - 2{XB*v,f) < e 2 

it holds 

f - \B*v ~ e prox A ^(/). 



12 Recall that the e-subdifferential d e of a convex functional £7 : H — > K U {+00} is defined as the set 

&fi(/) := {h £% : Sl(g) - > (ft, 3 - />„ - e, V 5 G «}, V/ G W. 
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Proof of Theorem^ Since the the regularizer Of can be written as a composition of ui o B, with _B = V 



and w : E — >• [0, +00), = X) a =i ll^alU Proposition |B| applied with A = t/o, ensures that each 
sequence of the type \7*v q which meets the condition | |28| | generates, vi a ([29) , admissible approximations 
of proXjr^i?. Therefore, if issuchthats* ~ 1/t^with/ > 3/2, Theorem 6 limplies that the inexact version 

of the FB-splitting algorithm in (29 1 shares the 1/t 2 convergence rate. Equation | [32) directly follows from 



the definition of strong convexity, 



y II/* - / t ii 2 < ^ T (/*)/2 + f r (r)/2 - s r (f/2 + r/2) < \{r{f) - £ T (.h) 



□ 



Proof of Proposition \4.6\ We first show that the matrices K, Z OJ L a defined in |9]l,l(l0), and ( |rT| |, are the ma- 
trices associated to the operators SS* : E n -> E™, SX>* : R n ^ M" and D a V* : R nd -t R n , respectively. 
For K, the proof is trivial and derives directly from the definition of adjoint of S- see Proposition [X] For 
Z a and Z, from the definition of D* we have that 

^ n n 

(sT>*a) = -y^«j(^gfe)a^(^i) = y^(Zg)i,jaj = (Z a a)i, 
3=1 3=1 

so that SV*/3 = Ea=i ^a(/3a,i)"=i = E«=i z a(A.,i)?=i = Z£ For L a , we first observe that 



{(d a k) x , (d b k) x >) 



H 



d(d b k) x ,(t) 



dt a 



d 2 k{s,t) 



dt a ds b 



so that operator D a Dl : E n — > E™ is given by 

1 ™ 

((-Da-Db)"). = ({dak) Xi ,D* b v) n = - ^2{(d a k) Xi ,(dbk) X:j } n Vj = (K.^.^Vj 

71 3=1 

for i = 1, . . . , n, for aE v e R". Then, since D a V*f3 = J2t=i D a Dl (0 a ,i)f=i, we have that L a is the matrix 
associated to the operator D a V* : R nd -> E™, that is 

n e£ 

(^*j9)<=x;xI(l. ) 6)«a )J - ) 

3=1 6=1 

for i = 1, . . . , n, for all /3 G To prove equation f33| l, first note that, as we have done in Proposition 



4.3 extended, p3) can be equivalently rewritten as /* = S*a* + V*/3*. We now proceed by induction. 
The base case, namely the representation for t = and f = 1, is clear. Then, by the inductive hypothesis 
we have that J*" 1 = S^a*" 1 + V*^" 1 , and f- 2 = S*a*- 2 + V*/?*" 2 so that /* = S*a f + V*/3* with a* 
and /3* defined by ( (13) and | (T4) . Therefore, using (22 1, (29 1, (25 1 it follows that /* can be expressed as: 



(I-TrxeJ IS*! ( 1 - — ) a 4 - i (W + Z? -y) ) + (l - -) V*/3* 



and the proposition is proved, letting a*, /?* and «* as in Equations ([15}, 1 17 1 and | |26) . 
For the projection, we first observe that operator -DaS** : E n — > M. n is given by 



^ n n 

(D a S*a)i = ((S*a, (d a k) Xi ) n ) = - ^2a j (d a k) x .(x J ) = ^(Z,)^ = Z^a. 



3=1 



3=1 



Then, we can plug the representation d37b in 1 30 1 to obtain 1116). 



□ 
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Proof of Proposition \4.6\ Since f T is the unique minimizer of the functional £ T , it satisfies the Euler equa- 
tion for £ T 

o e a(£(.r) + 2rnf oh + rvfrwi). 

where, for an arbitrary A > 0, the subdifferential of Afif at / is given by 

d\h? (/) ={v*«,«=(«„)2 =1 G(R n ) d | Ua = aaj/HAJIU if iiAJIU >o, 

and ||w a ||n < A otherwise, Va = 1, . . .,d} 

Using the above characterization and the fact that £ + ru\\-\\^ is differentiate, the Euler equation is 
equivalent to 

V*v = -±V(£ + rv\\-f H ){f T l 
for any a > , and for some v = (v a )a=i e ( Rn ) d witn v = ( v a)t =i € (R n ) d such that 

«a = - „ f^r. if Pa/ T ||n>0, 
T 

w a e — B n otherwise. 
a 

In order to prove ( [34) , we proceed by contradiction and assume that ||-D a / T || n > 0. This would imply 
\\v a \\n = t/ct, which contradicts the assumption, hence ||-D / r ||„ =0. 

We now prove (35 1. First, according to Definition [2] (see also Theorem 4.3 in Il56l and O for the case 
when the proximity operator is evaluated exactly), the algorithm generates by construction sequences 
/* and /* such that 



~f ~f- ^VF(/ 4 ) G -B^ItQ® (/«) = ^% (£f)2 f2f (/'). 

where d e denotes the e-subdifferentiaj^] Plugging the definition of /' from ( (29} in the above equation, 
we obtain V*u* G ^d^( £ tyQf(f t ). Now, we can use a kind of transportation formula [34] for the e- 
subdifferential to find e* such that V*u* G ^d^^t^h® (f T ). By definition of e-subdifferential: 

nf (/) - nf (/ ( ) > (-vv, / - /% - f (e 4 ) 2 , v/ g 

r 2r 
Adding and subtracting £lf (f T ) and (^V*w*, / r ) to the previous inequality we obtain 

nf (/) - (r) > <-w,/ - r)« - f (e*) 2 , 

with 

(e*)3 = ( £ *) 2 + ^(fif (/* ) - fif (/>)) + ( 2 W,.f - 



From the previous equation, using | [32| | we have 

(£') 2 = (e') 2 + 



which implies ^ V*w* G d a (s%yi i^J^x Q 1 ")' Now, relying on the structure of fif, it is easy to see that 

deft? {/) C {V*u,w = K)^ =1 e(M") d I |K||„ > 1 - if ||A,/||„>0}. 

Thus, if ||AJ T ||n > Owe have ||v*|| n > £(1 - 2|| jff r|| J - □ 



13 Recall that the e-subdifferential, <9 E , of a convex functional O, : ft — > R U {+00} is defined as the set 

9«n(/) - {hen ■. fi( s ) - n(/) >(ft,s-/)K-e, v 9 e«}, v/ e ft. 
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C Proofs of Section [5] 

We start proving the following preliminary probabilistic inequalities. 

Lemma 1. For < rji , 772 , V3 1 Vi < 1/ n £ N, zf /zoZds 

xy 2/ 2 <^p(^, 2/) < e(n,r?i)J >l-?yi with e(n,7?i) = 

(2) P(||S*y-7]E/p||w< cCn,^)) >l-jj2 with e(n, m )-- 

(3) P(||5*5-^l fe || <e(n, %))>!-% with e(n, m )-- 

(4) p(\\D* a D a - D* a D a \\ <e(n, V4 ))>l- m with e(n,r7 4 ) = 



: -^M 2 log—, 

:^^KlMl0g— , 

V" ??2 

2\/2 9l 2 

= — lOg—, 



2\/2 2 . 



2 

'/4 



Proof. From standard concentration inequalities for Hilbert space valued random variables - see for 
example [47|- we have that, if £ is a random variable with values in a Hilbert space % bounded by L 
and £1 , . . . , £„ are n i.i.d. samples, then 

\\lj2$i-m)\\<<n,r 1 ) = ^§Llog- 
n v n r\ 



with probability at least 1 — 77, rj g [0,1]. The proof is a direct application of the above inequalities to the 
random variables, 

(1) £=y 2 (£ 1 with sup Zn ||£|| < M 2 , 

(2) \=k x y £e H(g>R with sup z " \\l\\ < n x M, 

(3) ^={-,k x ) n k x £e HS(H) with su Pz J|£||« s (H) < k\, 

(4) £ =<•, {d a k) x ) u {d a k) x £ e ftS(ft) with sup Zn ||£||«s(H) < «§. 

where HS(H), \\-\\hs(H) are tne space of Hilbert-Schmidt operators on % and the corresponding norm, 
respectively (note that in the final bound we upper-bound the operator norm by the Hilbert-Schmidt 
norm). 



□ 



Proofs of the Consistency of the Regularizer. We restate Theorem[2]in an extended form. 
Theorem [Theorem[2]Extended] Let r < 00, then under assumption (A2), for any 77 > 0, 



2^2 



2d 



p( 5 up |nf(/)-n?a)l>rd^«2jiog-j <n 



\\f\\n<r 



(41) 



Consequently 



limP sup |0f (/) - fi?(/)| > e =0, Ve>0. 



11/11 H< r 



2 



1/2 



Proof. For f E W consider the following chain of inequalities, 

d 

|n?(/)-n?(/)|<£|0a/||„-P«/| 
0=1 

d 

<£(|||A»/|£-||A»/| 
o=l 
d 

= ^(|(/, (t>* a D a -D a *D a )f) n 

a=l 
d 

<Y,\\D* a D a -D a *D a mf\\ n , 



1/2 
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that follows from from \sfx — Jy\ < y/\x — y\, the definition of D a , D a and basic inequalities. Then, 
using d times inequality (d) in Lemma [T| with r\jd in place of 774, and taking the supremum on / € % 
such that \\f\\u < T , we have with probability 1 — rj, 



sup \h?(f)-nV(f)\<rd^ K Jlog-. 
\\f\\n<r («) 1/4 V * 

The last statement of the theorem follows easily. □ 

Consistency Proofs. To prove Theorem|3j we need the following lemma. 
Lemma 2. Let 77 e (0, 1]. Under assumptions Al and A3, we have 

sup \£{f) - £{f)\ < ^2 {k\t 2 + 1k x Mt + M 2 ) log -, 

H;if/j probabilty 1 — 77. 

Proof. Recalling the definition of Ik we have that, 

£(/)= / (hf(x)-y) 2 dp(x,y) 
Jxxy 

= f (I k f(x)) 2 d Px (x)+ f y 2 dp(x,y)-2 f I k f{x)ydp{x,y) 
Jx Jxxy Jxxy 

(hf(x)) 2 dp x (x)+ f y 2 dp(x,y)-2 f hf(x)f p (x)d Px (x) 
x Jxxy Jx 



= (f,W)-H + / y A dp(x,y) - 2{f,I* k f p ) n . 
J xxy 

Similarly £{f) = (/, S*Sf) n + \\y\\ 2 n - 2(f, S*f p ) H . Then, for all fen, we have the bound 

|f(/)-f(/)|<||5*5-^7 fc ||||/||^ + 2||5*y-4*/ /3 ||w||/||K+ ||y||Jl - / y 2 dp(x,y) 

J xxy 

The proof follows applying Lemma[T]with probabilities 771 = 772 = 773 = ?y/3. □ 

We now prove Theorem 3] We use the following standard result in regularization theory (see for 
example [26 J) to control the the approximation error. 

Proposition. Let t„ — » 0, be a positive sequence. Then we have that 

£ Tn (f Tn ) - inf £(f) -> 0. 
w ; fen K ' 

Proof 'of Theorem^ We recall the standard sample /approximation error decomposition 

£{f) mf £(/) < \£{h-£ T {f)\ + \£ T {f) - inf £{f)\ (42) 

where - £(f) + 2rfi?(/) + ri/||/|&. 

We first consider the sample error. Toward this end, we note that 

HI/1?/. < £ T (h < £ T (o) = \\yf n => 11/1* < % < ''' 
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and similarly \\r\\ n < {^y^dpf/ 2 /^ < JL. 
We have the following bound, 

£(f T ) - £ T (f T ) < (£(f T ) - £(h) + £(h ~ £ T (f T ) 

< {£{h - £{h) + £ T ih - £ T in 

< (£(h - £(h) + £ T {f T ) - £ T (f T ) 

< (£(h - £(fl) + {£{fl - £{D) + r(^(f T ) - ^?(/ r )) 
<2 sup \£{f)-£{f)\ + r sup |fif(/)-fi?(/)|. 

II/II«<^S ll/ll«<^j 

Let 7/ e (0, 1]. Using Lemma [2] with probability 77 = 3r// (3 + d), and inequality ((41) with 77 = dt]'/(3 + d), 
and if 77' is sufficiently small we obtain 

£(f T )-£ T (f T )< — + + 1 log — + t— — d^« 2 J log — . 

with probability 1 — 77'. Furthermore, we have the bound 

Mk\ T 1 / 2 dK 2 \. 6 + 2d 



fr) _ f r (r) < c ^, +^7= Mog^r- (43) 
\77 i /^Ti/ n^l^^/v ) rf 

where c does not depend on n, r, i/, cL The proof follows, if we plug (j43j in | [42) and take r = t„ such 
that T n —} and {T n -\/n)~^ — > 0, since the approximation error goes to zero (using Proposition |c| and 
the sample error goes to zero in probability as n —> 00 by ||43|. 

□ 

We next consider convergence in the RKHS norm. The following result on the convergence of the 
approximation error is standard [26 J. 

Proposition. Let r„ — > 0, be a positive sequence. Then we have that 

We can now prove Theorem |4] The main difficulty is to control the sample error in the "H-norm. This 
requires showing that controlling the distance between the minima of two functionals, we can control 
the distance between their minimizers. Towards this end it is critical to use the results in |55] based on 
Attouch-Wetts convergence. We need to recall some useful quantities. Given two subsets A and B in a 
metric space (H, d), the excess of A on B is defined as e(A, B) := supj eA d(f, B), with the convention 
that e(0,-B) = for every B. Localizing the definition of the excess we get the quantity e r (A, B) : 
e(A n B(0, r), B) for each ball B(0, r) of radius r centered at the origin. The r-epi-distance between two 
subsets A and B of H, is denoted by d r (A, B) and is defined as 

d r (A, B) := max{e r (A, B),e r (B, A)}. 

The notion of epi-distance can be extended to any two functionals F, G : % — > K by 

d r (G, F) := d r (epi(G),epi(F)), 

where for any F : H —> M, epi(F) denotes the epigraph of F defined as 

epi(F) :={(f,a),F(f)<a}. 

We are now ready to prove Theorem|4j which we present here in an extended form. 
Theorem [Theorem|l]Extended] Under assumptions Al, A2 and A3, 

P( ||f - fl\\u > A(n,r)^ + ||r - f}\\n,) < V (44) 
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where 

, fK, r ( 4k?M 4kx 1 2dn 2 \ 

A(n, t) = 4V2M -J^ + 1 + _- + 2 

for < 77 < 1. Moreover, 

limP(||/ r --/t||«>e)=0, Ve>0, 
/or any r n smc/z f/iaf t„ — > and (nA"" 2 ) -1 — > 0. 

Proof of Theorem^ We consider the decomposition of ||/ T — /^||<h mt o a sample and approximation 
term, 

\\r - fly < wr - r\\n + \\r - flu- (45) 

From Theorem 2.6 in 155 1 we have that 



r T u(\\r - ru) < M M/V ^(t £ ,£\t s ,£ r ) 

where il>% v {t) := inf{^s 2 + \t — s\ : s € [0, +oo)}, and t £ r is the translation map defined as 

t £ rG(f) = G(f+n-£ T (n 

for all G : H -> R. 

From Theorem 2.7 in [55J, we have that 

d M/ ^{t £ r£\t £ r£ T ) < SUP \t £ r£ T (f)-t £ r£ T (f)\. 

\\f\\ n <M/^FU 

We have the bound, 

sup \t £ r£ T (f)-t £ rr(f)\ < sup \r(f)-r(f)\ 

\\f\\ n <M/V¥V \\f\\n<M/SFZ+\\r\\ H 

< sup \£(f)-£(f)\+r sup |fi?(/)-nf(/)|. 

||/||„<2M/ V AF]7 \\f\\ n <2M/^ 

Using Theorem [2] (equation |4TJ) and Lemma[2]we obtain with probability 1 — rj', if i]' is small enough, 

, „_ \ 2 V2 ( 2 4M 2 A M 2 , r2 \, 6 + 2d 2M 2y/2 f 6 + 2d 
d M/V¥U (t £ ,£ T ,t £ ,£ T ) < U 2 +4 Kl -=+M 2 log + T d-^KiJiog — 

< 2v^M + + ^+r ;_ log — . (46) 



From the definition of ij)% v it is possible to see that we can write explicitly (tp%) 1 as 

«„r 1 (») = <v™ if f < ^ 

{y + if- otherwise. 



Since r = r„ — > by assumption, for sufficiently large n, the bound in (46 1 is smaller than 1/2tv, and 
we obtain that with probability 1 — r}', 

1 /nT z iy z JtltvJtv Jrvrv n L i^V\/Tv 



If we now plug ||47j in ||45j we obtain the first part of the proof. The rest of the proof follows by taking 
the limit n — > 00, and by observing that, if one chooses t = r„ such that r„ — > and (r^y/n)^ 1 — > 0, the 
assumption of Proposition [C] is satisfied and the bound in | [47) goes to 0, so that the limit of the sum of 
the sample and approximation terms goes to 0. 

□ 
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Proofs of the Selection properties. In order to prove our main selection result, we will need the 
following lemma. 

Lemma 3. Under assumptions Al, A2 and A3 and defining A(n, t) as in Theorem^extended, we have, for all 
a = 1, .... d and for all e > 0, 

P(\\\D a r\\ 2 n-\\DafX x \>t) <( 6 + 2d ) eX p(-^^)' 



where a(n, r) — 2 max{ \^ TV 2 , In^Ain, r)} and \im T ^ b(r) = 0. 
Proof. We have the following set of inequalities 



\\Daf 



T\\1 



\D a fX x = \(r,D* a Daf T )n - (flD* a D a fl) n + 
(r,D* a D a h n - (r,D* a D a h n + 
(ft,D* a D a f) H - (fl,D* a D a h n \ 

= (/\ (D* a D a - D* a D a )r)n + (F - f P ,D* a D a (r - fj,))u 

- - M 2 . 

<\\ir a D a -ir a D a \\— + l $\\r-ft\& 

M 2 

< \\D* a D a - D* a D a \\— + 24\\r ~r\\ 2 n + f%.- 



Using Theorem [4] extended, equation (47) , and Lemma[T]with probability 774 = r;/(3 + d), we obtain 
with probability 1 — 77 



lA,rii;-PJ2ii^ 

We can further write 



< 



,og «±* + 2^(„, r, log «±*i + 2«=||r - /{ft. 



\Daf T \\l \\D a flt px < a(n, r) log + 6(r) 



where a(n,r) = 2 max{ 2v ^'^ h ' 2 , 2n^A{n, r)} and lim T ^ ^( r ) = according to PropositionC 



follows by writing e = a(n, r) log 

Finally we can prove Theorem [5] 
Proof of Theorem^ We have 



S+2d 

n 



b(r) and inverting it with respect to r\. 



The proof 

□ 



P(i? p Ci? r ) =1-P(^pj = 1_P ( U { a ^^ T }] - £ P(a<M T 

Let us now estimate P^a ^ R T ^j or equivalently P^ae .R r ^ = P^ \\D a f T \\l > 0), for a e R p . Let 
C < mm ae R p \\D a fJ,\\ 2 x . From Lemma|3j there exist a(n,r) and 6(r) satisfying lim T ^ &(t) = 0, such 



that 



\\D a f. 



t||2 

P II PA/ 



< e 



with probability 1 — (6 + 2d)exp(— )/ for all a = 1, . . . , d. Therefore, for e = C, for a e i? p , it holds 

\\Dah\ 2 n>\\D a fX X - C ^°- 
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with probability 1 - (6 + 2d)exp(-^£^) . We than have 

P(aeir) =P(\\D a f\\l >0) >l-(6 + 2rf)cxp(^- C a ~ b y 

so that P^a ^ < (6 + 2d)exp ^— C a ^^ ) • Finally, if we let r = r„ satisfying the assumption, we 
have lim„ 6(r„) — > 0, lim„ a(n, r„) — > 0, so that 



lim P R„ C R T ") > lim 



1 - |i2 p |(6 + 2d)exp(- 



C - c(r n ) 



1 - |i? p |(6 + 2d) lim cxp 



a(n,r„) 
C - b(r n ) 
a(n,r n ) 



= 1. 



□ 
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Table 3: List of symbols and notations 



Spaces and distribu- 
tions 


X C E d 


input space 




output space 


p 


probability distribution on A" x ^ 


Px 


marginal distribution of p 


T?(X n v \ 


{/ : A" — S- E : measurable and s.t. f f{x) 2 dpx{x) < +00} 


14 


RKHS C {/ : X -> J} 


Norms and scalar 
products 


II • lln and (■,■)„ 

ii 1 1 n \ ' / n 


^= ■ euclidean norm and scalar product 


Hp* and (;-) px 


norm and scalar product in L 2 (X, px) 


\\-\\n and (•,•>„ 


norm and scalar product in H 


Functionals and Op- 
erators 


o?p - 7-/ ^ rn +oo1 


n?(/) = E V 

a=l V 


'f x (Wdp*(x)y 


Of : H —J- [0,+oo) 


"?(/) = E \ 
a=i y 


1 f (efM\ 2 

n ' ^ Q x a J 


£:U^f [0,+oo) 


S(f) = f x (f(x 


) - yf dp{x,y) 


£ T [0, +oo) 


£ T (f) = f x (f( 


x)-y) 2 dp(x,y) + T(2n»(f) + v\\f\\ 2 H ) 


E-.U^t [0,+oo) 


n 

£(/) = £s(/(*0-iK) a 


£ T [0, +oo) 


£ T (/) = E - y0 2 + r(2flf (/) + v\\f\\ 2 H ) 

1=1 


K-.U^t L 2 (X , p x ) 


(/*/)(*) = (/, 


k x )u 


S : H -»■ K n 


5/ - (/(xi),. 


■,f(Xn)) 


D a :n^ L 2 (X,p x ) 


= (/, (0«fc)*> 




Da(f) = (&(x 1 ),...,£t i (x n )) 


V:H^(L 2 (X,px)) d 


V/ = (£>«/)„= 


1 


V : H -> (K") d 


V/ = (£ a /)f= 


1 


Functions 


: A" ->• R 






argmin /eargmin 




(<9 a fc) x : A" -> K 


. 8fc(«,t) 




the minimizer in % of E T 


r 


the minimizer in % of £ T 


Sets 




{a£ {l,...d} 




i? T 


{ae{l,...d} 




Bn 


{v£R n : |M| n < 1} 




{v£R n : \\v\\ n <l} d 
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